library("knitr") # for knitting RMarkdown
library("kableExtra") # for nice RMarkdown tables
library("tidybayes") # tidying up results from Bayesian models
library("lme4") # for linear mixed effects models
library("brms") # Bayesian regression models with Stan
library("car") # for bootstrapping regression models
library("broom") # for tidy regression results
library("janitor") # for cleaning up variable names
library("patchwork") # for figure panels
library("ggeffects") # for visualizing estimated marginal means
library("stargazer") # for latex regression tables
library("sjPlot") # for nice RMarkdown regression tables
library("xtable") # for latex tables
library("tidyverse") # for wrangling, plotting, etc. “Regression diagnostics are methods for determining whether a fitted regression model adequately represents the data.” (p. 385) (Fox and Weisberg 2018)
Let’s illustrate via the credit card balance data set.
Let’s fit a linear regression and visualize the residuals.
#>
#> Call:
#> lm(formula = balance ~ 1 + income + married, data = df.credit)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -816.96 -350.50 -50.68 331.67 1087.53
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 258.8819 41.4313 6.248 1.07e-09 ***
#> income 6.0587 0.5803 10.441 < 2e-16 ***
#> marriedYes -20.9546 41.9260 -0.500 0.617
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 408.2 on 397 degrees of freedom
#> Multiple R-squared: 0.2155, Adjusted R-squared: 0.2115
#> F-statistic: 54.52 on 2 and 397 DF, p-value: < 2.2e-16
income is a significat predictor of balance but whether or not a person is married is not.
Let’s look at the residuals:
fit.credit %>%
augment() %>%
clean_names() %>%
ggplot(data = .,
mapping = aes(x = fitted,
y = resid)) +
geom_point() This plot helps assess whether there is homogeneity of variance. Overall, the residual plot looks pretty ok. The diagonal points in the bottom left of th plot arise because credit card balance is not an unbounded variable, and some of the people have a credit card balance of 0.
We can also check whether the residuals are normally distributed by plotting a density of the residuals, and a quantile quantile plot.
df.plot = fit.credit %>%
augment() %>%
clean_names()
p1 = ggplot(data = df.plot,
mapping = aes(x = resid)) +
geom_density() +
labs(title = "Density plot")
p2 = ggplot(data = df.plot,
mapping = aes(sample = scale(resid))) +
geom_qq_line() +
geom_qq() +
labs(title = "QQ plot",
x = "theoretical",
y = "standardized residuals")
p1 + p2The residuals aren’t really normally distributed. As both the density and the QQ plot show, residuals with low/negative values are more frequent than residuals with high/positive values.
We can also inspect residual plots via using plot() on the model fit.
Because linear regression models are fitted by minimizing the squared error between prediction and data, the results can be strongly influenced by outliers. There are a number of ways of checking for outliers.
Data points that are far from the center of the predictor space have potentially greater influence on the results – these points have high leverage.
hat-values are a way of characterizing how much influence individual data points have. hat values are bounded bewtween 1/n and 1, and the sum of all the hat values is equal to the number of coefficients in the model (including the intercept). If a data point has high influence this means that the results would change considerably if this data point were removed.
Based on this result, it looks like case 324 has a high influence on the predictions. Let’s see what the results would look like with, and without that data point.
# fit model without the data point of interest
fit.credit2 = update(fit.credit,
data = df.credit %>%
filter(x1 != 324))
res_with_outlier = fit.credit %>%
augment() %>%
filter(row_number() == 324) %>%
pull(.resid)
res_without_outlier = fit.credit2 %>%
augment(newdata = df.credit) %>%
mutate(.resid = balance - .fitted) %>%
filter(row_number() == 324) %>%
pull(.resid)
hat1 = 1 - (res_with_outlier/res_without_outlier) %>%
round(3)
hat2 = fit.credit %>%
augment() %>%
filter(row_number() == 324) %>%
pull(.hat) %>%
round(3)
print(str_c("hat1: ", hat1))#> [1] "hat1: 0.042"
#> [1] "hat2: 0.042"
Cook’s distance is defined as
\[ D_i = \frac{e^2_{Si}}{k + 1} \times \frac{h_i}{1-h_1}\],
where \(e^2_{Si}\) is the squared standardized residual, \(k\) is the number of coefficients in the model (excluding the intercept), and \(h_i\) is the hat-value for case \(i\).
Let’s double check here:
fit.credit %>%
augment() %>%
mutate(cook = ((.std.resid^2)/(2 + 1)) * (.hat/(1 - .hat))) %>%
select(contains("cook")) %>%
head(10)#> # A tibble: 10 x 2
#> .cooksd cook
#> <dbl> <dbl>
#> 1 0.000000289 0.000000289
#> 2 0.0000119 0.0000119
#> 3 0.00281 0.00281
#> 4 0.00238 0.00238
#> 5 0.000519 0.000519
#> 6 0.00308 0.00308
#> 7 0.000510 0.000510
#> 8 0.000530 0.000530
#> 9 0.0000842 0.0000842
#> 10 0.00500 0.00500
Looking good!
fit.credit %>%
augment() %>%
ggplot(aes(x = .hat,
y = .std.resid)) +
geom_point() +
geom_line(aes(y = .cooksd),
color = "red")df.test = tibble(x = runif(n = 10),
y = x * 2 + rnorm(10, sd = 2)) %>%
bind_rows(tibble(x = 0.9,
y = -30)) %>%
mutate(index = 1:n())
fit.test = lm(y ~ x,
df.test)
fit.test %>%
augment() %>%
mutate(index = 1:n()) %>%
ggplot(aes(x = .hat,
y = .std.resid)) +
geom_point() +
geom_line(aes(y = .cooksd),
color = "red") +
geom_text(aes(label = index),
nudge_y = -0.2)p1 = ggplot(data = df.un,
mapping = aes(x = infant_mortality)) +
geom_density()
# log transformed
p2 = ggplot(data = df.un,
mapping = aes(x = log(infant_mortality))) +
geom_density()
p3 = ggplot(data = df.un,
mapping = aes(x = ppgdp)) +
geom_density()
# log transformed
p4 = ggplot(data = df.un,
mapping = aes(x = log(ppgdp))) +
geom_density()
p1 + p2 + p3 + p4 +
plot_layout(nrow = 2)fit.mortality1 = lm(formula = infant_mortality ~ ppgdp,
data = df.un)
fit.mortality2 = lm(formula = log(infant_mortality) ~ log(ppgdp),
data = df.un)
fit.mortality3 = lm(formula = log(infant_mortality) ~ ppgdp,
data = df.un)
fit.mortality4 = lm(formula = infant_mortality ~ log(ppgdp),
data = df.un)
summary(fit.mortality1)#>
#> Call:
#> lm(formula = infant_mortality ~ ppgdp, data = df.un)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -31.48 -18.65 -8.59 10.86 83.59
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 41.3780016 2.2157454 18.675 < 2e-16 ***
#> ppgdp -0.0008656 0.0001041 -8.312 1.73e-14 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 25.13 on 191 degrees of freedom
#> Multiple R-squared: 0.2656, Adjusted R-squared: 0.2618
#> F-statistic: 69.08 on 1 and 191 DF, p-value: 1.73e-14
#>
#> Call:
#> lm(formula = log(infant_mortality) ~ log(ppgdp), data = df.un)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.16789 -0.36738 -0.02351 0.24544 2.43503
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 8.10377 0.21087 38.43 <2e-16 ***
#> log(ppgdp) -0.61680 0.02465 -25.02 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.5281 on 191 degrees of freedom
#> Multiple R-squared: 0.7662, Adjusted R-squared: 0.765
#> F-statistic: 625.9 on 1 and 191 DF, p-value: < 2.2e-16
#>
#> Call:
#> lm(formula = log(infant_mortality) ~ ppgdp, data = df.un)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.61611 -0.48094 -0.07858 0.53930 2.17745
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.479e+00 6.537e-02 53.23 <2e-16 ***
#> ppgdp -4.595e-05 3.072e-06 -14.96 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.7413 on 191 degrees of freedom
#> Multiple R-squared: 0.5394, Adjusted R-squared: 0.537
#> F-statistic: 223.7 on 1 and 191 DF, p-value: < 2.2e-16
#>
#> Call:
#> lm(formula = infant_mortality ~ log(ppgdp), data = df.un)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -38.239 -11.609 -2.829 8.122 82.183
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 155.7698 7.2431 21.51 <2e-16 ***
#> log(ppgdp) -14.8617 0.8468 -17.55 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 18.14 on 191 degrees of freedom
#> Multiple R-squared: 0.6172, Adjusted R-squared: 0.6152
#> F-statistic: 308 on 1 and 191 DF, p-value: < 2.2e-16
#> bcPower Transformation to Normality
#> Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
#> Y1 0.0945 0 -0.0435 0.2326
#>
#> Likelihood ratio test that transformation parameter is equal to 0
#> (log transformation)
#> LRT df pval
#> LR test, lambda = (0) 1.809774 1 0.17854
#>
#> Likelihood ratio test that no transformation is needed
#> LRT df pval
#> LR test, lambda = (1) 146.7235 1 < 2.22e-16
#> Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.
#> Model has log-transformed predictors. Consider using `terms="ppgdp [exp]"` to back-transform scale.
#> Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.
# with log transforms
fit.mortality5 = lm(formula = log(infant_mortality) ~ log(ppgdp) + group,
data = df.un)
# without log transforms
fit.mortality6 = lm(formula = infant_mortality ~ ppgdp + group,
data = df.un)
p1 = ggpredict(fit.mortality5,
terms = c("ppgdp [exp]", "group")) %>%
plot() +
labs(title = "Prediction with log transform") +
coord_cartesian(xlim = c(0, 20000))
p2 = ggpredict(fit.mortality6,
terms = c("ppgdp", "group")) %>%
plot() +
labs(title = "Prediction without log transform") +
coord_cartesian(xlim = c(0, 20000))
p1 + p2This section is based on this post here.
# make reproducible
set.seed(1)
n = 250
df.turkey = tibble(turkey_time = runif(n = n, min = 0, max = 50),
nap_time = 500 + turkey_time ^ 2 + rnorm(n, sd = 16))ggplot(data = df.turkey,
mapping = aes(x = turkey_time,
y = nap_time)) +
geom_smooth(method = "lm") +
geom_point() #> `geom_smooth()` using formula 'y ~ x'
A simple linear regression doesn’t fit the data well (not suprising since we included a squared predictor).
#>
#> Call:
#> lm(formula = nap_time ~ 1 + turkey_time, data = df.turkey)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -212.82 -146.78 -55.17 125.74 462.52
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 15.4974 23.3827 0.663 0.508
#> turkey_time 51.5746 0.8115 63.557 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 172.4 on 248 degrees of freedom
#> Multiple R-squared: 0.9422, Adjusted R-squared: 0.9419
#> F-statistic: 4039 on 1 and 248 DF, p-value: < 2.2e-16
#> # A tibble: 2 x 7
#> term estimate std.error statistic p.value conf.low conf.high
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 15.5 23.4 0.663 5.08e- 1 -30.6 61.6
#> 2 turkey_time 51.6 0.811 63.6 1.72e-155 50.0 53.2
As the plot above shows, a simple linear model doesn’t predict the data well.
A regression with a squared predictor would fit well:
fit.turkey2 = lm(formula = nap_time ~ 1 + I(turkey_time ^ 2),
data = df.turkey)
summary(fit.turkey2)#>
#> Call:
#> lm(formula = nap_time ~ 1 + I(turkey_time^2), data = df.turkey)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -45.611 -9.911 -0.652 11.137 43.008
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4.994e+02 1.575e+00 317.0 <2e-16 ***
#> I(turkey_time^2) 1.001e+00 1.439e-03 695.3 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 16.23 on 248 degrees of freedom
#> Multiple R-squared: 0.9995, Adjusted R-squared: 0.9995
#> F-statistic: 4.835e+05 on 1 and 248 DF, p-value: < 2.2e-16
fit.turkey2 %>%
augment() %>%
clean_names() %>%
ggplot(data = .,
mapping = aes(x = i_turkey_time_2,
y = nap_time)) +
geom_line(mapping = aes(y = fitted),
color = "blue") +
geom_point() #>
#> Number of bootstrap replications R = 999
#> original bootBias bootSE bootMed
#> (Intercept) 15.497 -1.130589 29.330 15.080
#> turkey_time 51.575 0.023717 1.058 51.625
#> Bootstrap bca confidence intervals
#>
#> 2.5 % 97.5 %
#> (Intercept) -41.94541 75.90403
#> turkey_time 49.26083 53.45920
#> Bootstrap percent confidence intervals
#>
#> 2.5 % 97.5 %
#> (Intercept) -47.63914 71.63622
#> turkey_time 49.35224 53.60821
We see that the confidence intervals using the bootstrap method are wider than the ones that use the linear regression model.
To export regression results into \(\latex\), I recommend the xtable and stargazer packages. The xtable package is nice and simple, wheres the stargazer package is more involved and allows for a lot of customization.
To show nice regression tables in RMarkdown, I recommend the sjPlot package.
| log(infant mortality) | ||||
|---|---|---|---|---|
| Predictors | Estimates | CI | Statistic | p |
| (Intercept) | 6.26 | 5.61 – 6.90 | 19.25 | <0.001 |
| ppgdp [log] | -0.46 | -0.52 – -0.40 | -15.28 | <0.001 |
| group [other] | 0.48 | 0.26 – 0.70 | 4.36 | <0.001 |
| group [africa] | 1.05 | 0.76 – 1.34 | 7.15 | <0.001 |
| Observations | 193 | |||
| R2 / R2 adjusted | 0.819 / 0.816 | |||
Information about this R session including which version of R was used, and what packages were loaded.
#> R version 3.6.2 (2019-12-12)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Mojave 10.14.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] forcats_0.5.0 stringr_1.4.0 dplyr_0.8.4 purrr_0.3.3
#> [5] readr_1.3.1 tidyr_1.0.2 tibble_2.1.3 ggplot2_3.3.0
#> [9] tidyverse_1.3.0 xtable_1.8-4 sjPlot_2.8.2 stargazer_5.2.2
#> [13] ggeffects_0.14.1 patchwork_1.0.0 janitor_1.2.1 broom_0.5.5
#> [17] car_3.0-6 carData_3.0-3 brms_2.12.0 Rcpp_1.0.3
#> [21] lme4_1.1-21 Matrix_1.2-18 tidybayes_2.0.1 kableExtra_1.1.0
#> [25] knitr_1.28
#>
#> loaded via a namespace (and not attached):
#> [1] readxl_1.3.1 backports_1.1.5 plyr_1.8.6
#> [4] igraph_1.2.4.2 splines_3.6.2 svUnit_0.7-12
#> [7] crosstalk_1.0.0 TH.data_1.0-10 rstantools_2.0.0
#> [10] inline_0.3.15 digest_0.6.25 htmltools_0.4.0
#> [13] rsconnect_0.8.16 fansi_0.4.1 magrittr_1.5
#> [16] openxlsx_4.1.4 modelr_0.1.6 matrixStats_0.55.0
#> [19] sandwich_2.5-1 xts_0.12-0 prettyunits_1.1.1
#> [22] colorspace_1.4-1 rvest_0.3.5 haven_2.2.0
#> [25] xfun_0.12 jsonlite_1.6.1 callr_3.4.2
#> [28] crayon_1.3.4 survival_3.1-8 zoo_1.8-7
#> [31] glue_1.3.1 gtable_0.3.0 emmeans_1.4.5
#> [34] webshot_0.5.2 sjstats_0.17.9 sjmisc_2.8.3
#> [37] pkgbuild_1.0.6 rstan_2.19.3 abind_1.4-5
#> [40] scales_1.1.0 mvtnorm_1.1-0 DBI_1.1.0
#> [43] miniUI_0.1.1.1 viridisLite_0.3.0 performance_0.4.4
#> [46] foreign_0.8-76 stats4_3.6.2 StanHeaders_2.21.0-1
#> [49] DT_0.12 htmlwidgets_1.5.1 httr_1.4.1
#> [52] threejs_0.3.3 arrayhelpers_1.1-0 ellipsis_0.3.0
#> [55] farver_2.0.3 pkgconfig_2.0.3 loo_2.2.0
#> [58] dbplyr_1.4.2 utf8_1.1.4 labeling_0.3
#> [61] tidyselect_1.0.0 rlang_0.4.5 reshape2_1.4.3
#> [64] later_1.0.0 effectsize_0.2.0 munsell_0.5.0
#> [67] cellranger_1.1.0 tools_3.6.2 cli_2.0.2
#> [70] generics_0.0.2 sjlabelled_1.1.3 ggridges_0.5.2
#> [73] evaluate_0.14 fastmap_1.0.1 yaml_2.2.1
#> [76] fs_1.3.2 processx_3.4.2 zip_2.0.4
#> [79] nlme_3.1-145 mime_0.9 xml2_1.2.2
#> [82] compiler_3.6.2 bayesplot_1.7.1 shinythemes_1.1.2
#> [85] rstudioapi_0.11 curl_4.3 reprex_0.3.0
#> [88] stringi_1.4.6 ps_1.3.2 parameters_0.5.0
#> [91] Brobdingnag_1.2-6 lattice_0.20-40 nloptr_1.2.1
#> [94] markdown_1.1 shinyjs_1.1 vctrs_0.2.3
#> [97] pillar_1.4.3 lifecycle_0.1.0 bridgesampling_1.0-0
#> [100] estimability_1.3 data.table_1.12.8 insight_0.8.1
#> [103] httpuv_1.5.2 R6_2.4.1 bookdown_0.18
#> [106] promises_1.1.0 gridExtra_2.3 rio_0.5.16
#> [109] codetools_0.2-16 boot_1.3-24 colourpicker_1.0
#> [112] MASS_7.3-51.5 gtools_3.8.1 assertthat_0.2.1
#> [115] withr_2.1.2 shinystan_2.5.0 multcomp_1.4-12
#> [118] mgcv_1.8-31 bayestestR_0.5.2 parallel_3.6.2
#> [121] hms_0.5.3 grid_3.6.2 coda_0.19-3
#> [124] minqa_1.2.4 snakecase_0.11.0 rmarkdown_2.1
#> [127] lubridate_1.7.4 shiny_1.4.0 base64enc_0.1-3
#> [130] dygraphs_1.1.1.6
Fox, John, and Sanford Weisberg. 2018. An R Companion to Applied Regression. Sage publications.